Role-similarity based comparison of directed networks. 
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The widespread relevance of complex networks is a valuable tool in the analysis of a broad range 
of systems. There is a demand for tools which enable the extraction of meaningful information and 
allow the comparison between different systems. We present a novel measure of similarity between 
nodes in different networks as a generalization of the concept of self-similarity. A similarity matrix is 
assembled as the distance between feature vectors that contain the in and out paths of all lengths for 
each node. Hence, nodes operating in a similar flow environment are considered similar regardless of 
network membership. We demonstrate that this method has the potential to be influential in tasks 
such as assigning identity or function to uncharacterized nodes. In addition an innovative application 
of graph partitioning to the raw results extends the concept to the comparison of networks in terms 
of their underlying role-structure. 
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The study of complex networks has experienced a dra- 
matic rise in popularity in recent years. Networks are a 
valuable way of representing complex data from a broad 
range of systems [l|-y| • Prominent examples include the 
World Wide Web, protein interaction networks, food 
webs and poHtical blogs. A network is a collection of 
nodes (individuals, web-pages, species, proteins etc) con- 
nected by edges that represent interactions (friendships, 
hyperlinks, predation, correlated behavior etc). 

With the escalation in computational capability and 
high-throughput technologies, networks can contain mil- 
lions of nodes or more, and useful information must be 
gleaned by means of statistical properties rather than by 
the study of each individual. In response to the demand 
for data analysis techniques, a huge variety of measures 
have been proposed to quantify and compare properties 
of these systems. 

A key challenge in this area is the development of 
methods to obtain simplified representations or models 
of complex network structure that capture the impor- 
tant characteristics. Initial models with random connec- 
tions do not display features common to many real world 
networks For instance, many networks exhibit mod- 
ular structure and numerous tools have been developed 
to detect tightly knit communities of significantly related 
nodes 

Almost all network research to date has focused on 
simple network relations. Edges are binary and with- 
out information on direction or value of each interaction. 
For a large subset of data the direction of interaction is 
essential. Examples include predation in food webs, hy- 
perlinks in the world wide web, and systems involving 
causality such as metabolic networks. 

On the whole, the direct comparison between different 
networks has so far been restricted to general statements 
summarizing network statistics. Examples of these mea- 
sures include degree distribution and connectivity. There 
is far more to be uncovered and here we focus on the 
cross-network comparison of the role played by individ- 



ual nodes and groups of nodes within network structure. 
Here, we will focus on the development of these meth- 
ods and demonstrate the implications of such an inves- 
tigation. For one, this enables the identification of func- 
tionally equivalent individuals within different networks. 
Secondly this gives rise to the development of a measure 
of network similarity based upon the underlying struc- 
ture. 

The subject of role similarity has been addressed from 
a number of angles and we take inspiration from a va- 
riety of sources. The social sciences have provided the 
impetus for much of this research with concepts such as 
centrality [1, 0], regular equivalence and block model- 
ing llOl . The research behind search engine algorithms 
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13| is also highly relevant. From a graph theoreti- 
cal perspective there have been approaches by Jeh and 
Widom [3 . Blondel et al [l^ and Leicht, Holme and 
Newman [l6|. Here we make a direct extension to our 
recently developed method for self-similarity calculation 
[l7| , the purpose being to measure the functional sim- 
ilarity between nodes in different networks based upon 
network connections. 

Given that the concept of role is dependent on the flow 
of information through a system, the inclusion of edge di- 
rection is natural. Regardless of which network a node 
belongs to, a flow profile is compiled using powers of the 
relevant adjacency matrix. A scaling parameter tunes 
the relative importance of local and global information. 
For instance, using only the most local edge information, 
sources and sinks can be identified in any network. For 
two nodes to be considered similar their networks will 
appear similar from their perspective and as more dis- 
tant information is included a more detailed structure 
emerges. Of course, a specific case is that in which the 
two networks are the same and the method reduces to 
computing self-similarity. 

Method. The measure is defined as follows. Consider 
two directed graphs A and B with Na and nodes and 
adjacency matrix Aa and A^, which are in general asym- 
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metric. The number of outgoing paths of length k for 
node i is given by the i-th coordinate of the vector [A*^!], 
where 1 is the x 1 vector of ones. Similarly, the num- 
ber of incoming paths of length k for node i is: [A^ l]i. 
Note that the case k = 1 corresponds to the out-degree 
and in-degree which, from this perspective, represent the 
number of paths of length one originating or terminating 
at the node. 

For each network we construct a matrix that compiles 
the incoming and outgoing paths of all lengths up a max- 
imum K by appending the column vectors indexed by 
path length and scaled by the factors P'^: 
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Where the column vectors Vk = {[3 A)''! and vvk — 
{(3A'^)''1. P — a/Xia, with Aiq the largest eigenvalue 
of the adjacency matrix and < a < 1. The param- 
eter a is a scaling factor that allows us to tune the 
weight of the local environment (short paths) relative 
to the global network structure (long paths). The pres- 
ence of the factors /3'' ensures the convergence of the 

sequence of the columns due to the asymptotic limit 

ll^fc+i| 

limfe^oo — !■ Ai. K is defined as the point at which 
the columns have converged [§]. 

Each row vector of Xa and Xf, contains the flow profile 
of a node in terms of the scaled number of incoming and 
outgoing paths of all lengths starting and ending at that 
node. The similarity between two nodes regardless of 
the network to which they belong can be quantified by 
the distance between the vectors Xa, and Xbj. A simple 
choice of metric is the cosine distance, which leads to the 
(generally) rectangular similarity matrix Y defined by: 



Y — 



(1) 



where element Yij provides a normalized measure of 
the closeness of the flow profiles of node i in network A 
and node j from network B. Naturally, if one compares 
a network with itself then the resulting matrix will be 
square with a diagonal consisting of ones. 

The above similarity matrix can also be calculated via 
an iterative algorithm based upon a definition of node 
similarity, making the method directly comparable to 
previously proposed algorithms and offering additional 
insight into the concept. 

A pair of similar nodes will be connected in the same 
way to other pairs of similar nodes. Referring to figure 
[1] if c and d are similar, and u is connected to c, and v 
is connected to d in the same way, then we can say that 
nodes u and v also have some similarity. The similarity 
between two nodes is determined by both the nature of 



their immediate connections {AJA^) and also the simi- 
larity of their neighbours {AY^A^) . The primary source 
of similarity is the immediate neighbourhood and the rel- 
ative importance of the local connections in comparison 
to the external information passed from a node's neigh- 
bours is determined by tuning the parameter a. 




FIG. 1. The intuitive concept behind the iterations is that 
the similarity between nodes u and v is determined by their 
immediate connections and by the similarity between their 
neighbours. 

The result is the following iterative procedure for cal- 
culating a matrix of similarity scores, Y , with Yq = the 
matrix of zeros (A^a, Nb). 

Considering outgoing connection Y{OUT) is defined 
as the convergent term of the iteration: 



F„+l =Aa{J + 



(2) 



where J is the matrix of ones and Yq is the matrix 
of zeros. Likewise for incoming connections Y{IN) is 
defined as the convergent term of the iteration: 



Y^+,=Al{J + 
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The final Similarity matrix is the sum of the final con- 
vergent terms of Y{OUT) + Y{IN) after a normalisation 
procedure. Convergence is guaranteed by the presence of 

2 

the -At- term. 

The normalisation procedure for rectangular matri- 
ces is to divide each entry of the AB-similarity matrix, 
Yij {A, B) by the square root of the diagonal entries of 
the respective self-similarity matrices, ^jYa^A, A) and 
yjYjj{B, B). Thus each entry is normalised by the value 
that would be produced if the node were compared to 
itself. 

This algorithmic formulation allows for simplified up- 
dated computations in a format equivalent, yet function- 
ally distinct, to other methods [l5|. The method de- 
scribed above neatly reduces to the self-similarity mea- 
sure and unlike previous methods does not suffer 
from issues arising from odd and even iterations [l6| . 
Convergence of the algorithm is guaranteed naturally and 
no prior assumptions regarding any similarity between 
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nodes is required. Structurally equivalent nodes will have 
similarity of one, regardless of the ordering of rows and 
columns. 

As an illustration, we explore a comparison between 
two small example networks. We construct two small 
directed networks as shown in figure [2] The resulting 
rectangular similarity matrix is also shown in gray-scale, 
indicating groups of nodes from both networks which ap- 
pear to play a similar role in their respective environ- 
ments. 

The choice of the parameter a produces the most in- 
teresting results at values close to 1. At lower values 
we trivially compare only the degree of each node. Care 
must be taken at the high end of the spectrum as values 
too close to 1 tend to be computationally prohibitive. As 
a rule of thumb, 0.95 < a < 0.99 produce interesting and 
stable results. 
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these networks: Bacillus subtilis (bsu). Listeria mono- 
cytogenes (Imo), Escherichia Coli (eco), Homo sapien 
(hsa), Mus Musculus (mmu), Pseudomonas aeruginosa 
(pae), Pyrobaculum aerophilum (pai) and Sulfolobus sol- 
fataricus (sso). 




0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,S 0,9 1,0 

Similarity Score 



FIG. 2. Small example networks and the resulting similarity 
matrix from a = 0.99 represented in gray scale. 



Results. Several alternatives exist to process the in- 
formation contained in the resulting similarity matrix for 
larger, real-world networks. We choose two complimen- 
tary lines of attack: the extraction of quantitative obser- 
vations directly from the similarity matrix, and further 
processing in the form of partitioning the rectangular ma- 
trix. 

Direct use of matrix entries. The first approach is 
to simply use the raw information. This in itself is very 
informative. For instance we can identify the node(s) 
in network B that most closely match a specific node 
from network A. This can be applied to assign a putative 
function or classification to nodes in an uncharacterised 
network when compared to a network in which the node 
identity is known. 

The potential application of this method is demon- 
strated on a small selection of metabolic networks. The 
database compiled from KEGG [11] by Ma and Zeng ^ 
includes 80 fully sequenced organisms in an extensive and 
carefully revised bioreaction database. Crucially, these 
networks have been represented as directed graphs by 
taking into account the reversibility of reactions. This 
data has been found to exhibit co re-p eriphery type struc- 
ture by various methods [11 [H [U 

The identity and characteristics of each node (metabo- 
lite) is already well studied. We take a small selection of 



FIG. 3. (a) Similarity matrix between metabolic networks Imo 
(b) and pai (c) , rearranged to approximately maximize the di- 
agonal, with synonymous node pairs highlighted in red. The 
distribution of matrix entries over all network pairs is shown 
in histrogram (d). In all four diagrams gray scale represents 
all nodes and red represents the synonymous metabolite pairs, 
(e) The mean similarity between synonymous nodes with re- 
spect to the mean matrix entry, averaged over all network 
pairs, increases with a. 

As the metabolic identity of each node is known, we 
can compare the role played by synonymous nodes (i.e. 
the same metabolite) across multiple networks. Figure 
[3] illustrates the concept for the metabolic networks Imo 
([SJd) and pai ([3}:). The metabolites common to all 8 net- 
works are highlighted in red. These two networks share 
195 common metabolites. With the rows and columns of 
the similarity matrix reordered to maximize the "diago- 
nal" and the synonymous metabolites highlighted in red 
figure [3^ highlights two interesting observations. Firstly, 
on the whole, metabolites tend to have a high similarity 
score with themselves (they are close to the diagonal). 
Secondly, they are spread throughout the network, with 
the complete range of roles represented. 

Over all network pairs in our set, the distribution of 
similarity scores for synonymous nodes (red) compared 
to the distribution of all entries in the matrix (gray) is 
shown in figure[3Jl. Synonymous nodes tend to have high 
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similarity scores, implying that most metabolites play 
a similar role in each network. The average score for 
synonymous metabolites across networks is significantly 
higher (0.880 ± 0.093) than the overaU average (0.628 ± 
0.047). 

Furthermore, on scanning through the complete range 
of the scaling parameter a, we find that the mean similar- 
ity score of synonymous nodes with respect to the overall 
matrix mean increases as a tends towards its maximum 
(figure [3^) . Although the local environment (largely the 
degree) of each node is informative, there is much to be 
gained from the inclusion of a global perspective. 

From this part of the analysis, the role in a metabolic 
sense appears to be correlated with our definition in 
terms of network flow. This has clear and promising im- 
plications for the use of these techniques in function or 
identity prediction. 

Partitioning the similarity matrix. An additional 
level of processing can extract further information from 
the matrix by grouping nodes according to their similar- 
ity score. The method involves the simultaneous cluster- 
ing (co-clustering) of nodes in network A and nodes in 
network B 2^1 • It is a simple extension of the popu- 



lar spectral partitioning method, the normalised cut to 
a bipartite graph by using singular value decomposition. 


The result of partitioning the rectangular similarity 
matrix is two vectors describing the grouping of nodes in 
each network: a partition of network A as classified by 
comparison with network B and vice versa. It is possi- 
ble that structure may be revealed within a network via 
comparison with another that is not obvious with self- 
similarity analysis. 

This leads towards a measure of similarity between 
networks. If we find the partition vectors of both net- 
works as obtained from the rectangular similarity matrix 
Y{A, B) are comparable to the partition vectors obtained 
from the respective self-similarity matrices iY{A, A) and 
Y{B,B)) then we could consider the underlying role or- 
ganization of the two networks to be similar. As such 
we propose a method of comparison between two net- 
works based upon the partition vectors obtained. We 
utilize a well studied quantity known as mutual infor- 
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which measures how well two partition 
vectors describe the same data. 

The normalised Mutual Information between parti- 
tions A and B is given by: 



MI{A, B) = 



Ni N i 



(4) 



Where the matrix TV is defined as a confusion matrix 
between partitions A and B. Rows of N correspond to 
clusters in partition A and columns to clusters in B. Nij 
is the number of objects in both cluster i of partition A 
and cluster j of partition B. Ni, and TV.j are row and 



column sums respectively, ca and cb are the number 
of cluster in A and B. In the case where A and B are 
identical, MI =1. If A and B are independent then 
MI{A, B) tends towards its minimum, zero. 

So if the MI score between the two partitions of net- 
work A (one found by clustering Y{A,A) and the other 
found by clustering Y{A, B)) and the equivalent MI score 
for network B are both high then one could conclude that 
the two networks have a similar underlying role structure. 

When performing partitioning, the number of clusters 
we partition into can be somewhat arbitrary, method de- 
pendent or user dependent. In fact by pre-defining the 
number of clusters the best match between two networks 
may be missed. Therefore we suggest that some flexibil- 
ity in the number of partitions will give us the best idea 
of how similar two matrices are in terms of their parti- 
tions. Although we apply spectral clustering it is feasible 
to use a number of different methods. 

For each similarity matrix an ensemble of partitions are 
produced using spectral clustering into c = 2,3. . .Cmax 
groups. Cmax IS defined by the user and will be depen- 
dent upon the size of the graph. The value should be 
significantly smaller than the number of nodes in the net- 
work. The MI score between each partition pair in the 
two ensembles is calculated and the maximum MI score is 
taken to be the similarity between the two matrices with 
respect to all possible partitions of the network induced 
by the matrices. 

We have observed from the self-similarity analysis of a 
number of data types [l^l that many networks can be de- 
scribed using a coarse-grained reduced representation by 
partitioning the self-similarity matrix. We construct an 
ensemble of directed networks for which the proportional 
flow structure between groups and the group sizes is iden- 
tical to the original network. All other aspects of the 
networks are random and in this case we keep the total 
number of edges the same. For example the world trade 
network (26l . [27| previously analyzed in [l7| divides into 
three groups (flgure[5f). As a test of this reduced model, 
flgure|4] demonstrates that these surrogate networks have 
significantly higher similarity with each other and with 
the original network than completely random networks 
with the same number of nodes and edges. Thus, using 
the reduced flow representation we can create surrogates 
with a similar global role-structure to the original net- 
work. 

To illustrate the power of this method in distinguishing 
between networks with different role-structure we con- 
struct an ensemble of example surrogate networks of the 
same size (100 nodes) although networks of different sizes 
can be compared. To construct with a structure akin to 
that of metabolic networks we use the networks of eco, 
pai and mmu as templates (figures [5li)a-c). Foodweb 
structure is well studied and various methods for con- 
struction of network models exist in the relevant litera- 
ture. We follow the widely accepted "niche" model 28 1 



5 



Surrogate 
Networks 



Random 
Networks 



Original Networl< 




FIG. 4. Gray-scale representation of network similarity for 
world trade data and surrogate networks. The original world 
trade network is reduced to a three node flow representation 
(figure [SIi)f). Surrogate networks with an identical number 
of nodes and edges are constructed from this model, preserv- 
ing the relative flow between groups. On comparison with 
the original network, with respect to random networks the 
surrogates show high similarity to each other and the original 
while the random networks perform poorly. 

and construct 10 networks. In addition we use the re- 
duced representation of the St Marks ecological network 
[29} as model to construct surrogates (figure [5l[i)d). We 
also include surrogates from the world trade model net- 
work (in)Ii)f). For completeness we include a set of random 
directed networks. 

Ten of each network type are constructed as defined by 
the respective models. The result of our calculations is a 
70 X 70 MI matrix describing the similarity between each 
network pair as defined by our measure. This matrix is 
illustrated in figure[5jii), together with an average of each 
block. There is a high degree of similarity between the 
model of St Marks foodweb, the niche foodweb model 
and the world trade model. The models of metabolic 
networks are clearly more involved, however display sig- 
nificantly higher similarity to each other than to the other 
network types. As one would expect, completely random 
directed networks should display no discernible structure 
and are dissimilar to all the network models, including 
other random networks. 

Conclusion. We have presented a generalization of 
the concept of node similarity to considering nodes in 
different networks. The environment of each node is com- 
pared in terms of directional flow. The most useful re- 
sults are those in which the most global information is 
included. 

The raw results contained in the similarity matrices ap- 
plied to metabolic networks show a promising elevation 
in similarity score values between nodes that are known 
to represent the same metabolite in both networks. This 
is a good indication of the correlation between the simi- 
larity score based upon flow structure and the biological 
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FIG. 5. (i)Reduced models of networks are compared by par- 
titioning of the similarity matrices. Metabolic network mod- 
els: (a) mmu, (b) eco, (c) pai; Foodweb models: (d) St Marks, 
(e) niche model (not illustrated); (f) world trade data and (g) 
random directed networks (not illustrated). The representa- 
tions have had edges with very low weight removed for clarity, 
(ii) The resulting matrix of maximum mutual information 
scores is displayed in gray scale along with a block-average 
simplification. 



functional similarity. 

An important extension developed in this work is a 
method by which to compare the underlying structure of 
networks by performing graph partitioning on the simi- 
larity matrices. We illustrate this method by application 
to a selection of network models. It is clearly shown that 
the networks constructed under the same regime display 
a high level of similarity in their underlying structure as 
analyzed by this method. 
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